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Abstract 

We aim to design strategies for sequential decision making that adjust to the difficulty of the learning problem. We 
study this question both in the setting of prediction with expert advice, and for more general combinatorial decision 
tasks. We are not satisfied with just guaranteeing minimax regret rates, but we want our algorithms to perform 
significantly better on easy data. Two popular ways to formalize such adaptivity are second-order regret bounds and 
quantile bounds. The underlying notions of ‘easy data’, which may be paraphrased as “the learning problem has small 
variance” and “multiple decisions are useful”, are synergetic. But even though there are sophisticated algorithms that 
exploit one of the two, no existing algorithm is able to adapt to both. 

In this paper we outline a new method for obtaining such adaptive algorithms, based on a potential function that 
aggregates a range of learning rates (which are essential tuning parameters). By choosing the right prior we construct 
efficient algorithms and show that they reap both benefits by proving the first bounds that are both second-order and 
incorporate quantiles. 

Keywords: Online learning, prediction with expert advice, combinatorial prediction, easy data 


1 Introduction 


We study the desig n of adaptive algorithms fo r online learning I Cesa-Bianchi and Lugosi . 20061] . Our work starts in 
the hedge setting MFreund and Schapire . 1997 1. a core instance of prediction with expe rt advice I Vovk . 1990t 19981 
Littlestone and Warmuthl 79941] and online convex optimization I Shalev-Shwartz , 2011 ]. Each round t = 1,2,... the 
learner plays a probability vector w t on K experts, the environment assigns a bounded loss to each expert in the form 
of a vector £ t £ [0, 1} K , and the learner incurs loss given by the dot product wj £ t . The learner’s goal is to perform 
almost as well as the best expert, without making any assumptions about the genesis of the losses. Specifically, the 
learner’s performance compared to expert k is = wj£ t — £\, and after any number of rounds T the goal is to have 
small regret Rff- = ]T\,_ t r'l with re spec t to every expert k. 


The Hedge algorithm by Freund and Schanird 1199711 ensures 


Rt -< VtToK for each expert k 


classic 


(1) 


(with -< denoting moral inequality, i.e. suppressing details inappropriate for this introduction), which is tight for adver¬ 
sarial (worst-case) losses I Cesa-Bianchi and Lugosil 200611 . Yet one can ask whether the worst case is also the common 
case, and indeed two lines of research show that t his bound can be improved greatly in various important scenarios . 


2014, Gaillard et al 


The first line of approach es I Cesa-Bianchi et al. . 2007 . Hazan and Kale . 201(1 Chiang et al. . 2012 . De Rooii et al. 


2014] obtains 


Rt -< yVf In K tor each expert k. (second order) 

That is, T can be reduced to Vp, which stands for some (there are various) kind of cumulative variance or related 
second-order quantity. This variance is then often shown to be small < T in important regimes like stochastic 
data (where it is typically bounded). The second line, independently in parallel, shows how to reduce the dependence 
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on the number of experts K whenever multiple experts perform well. This is expected to occur, for example, when ex¬ 
perts are constructed by finely discretising the parameters of a (probabilis tic) model, or when learning sub-algo rithms 
are used as experts. T he resulting so-called quantile bounds (see Chaudhuri et al.1 2009 . Chernov and Vovk 2010L 
Luo and Schapirell2014 ) of the form 


min Rtf -< 
fce/c 


y / T(- In 7 r(/C)) for each subset /C of experts 


(quantile) 


improve K to the reciprocal of the combined prior mass 7 r(/C), at the cost of now comparing to the worst expert among 
1C, so intuitively the best guarantee is obtained for K, the set of “sufficiently good” experts. (There is no requirement 
that the prior 7 r(k) is uniform, and consequently quantile bounds imply the closely related bounds with non-uniform 
priors, studied e.g. by Hutter and Poland 20051 ) As these two types of improvements are complementary, we would 
like to combine them in a single algorithm. However, the mentioned two approaches are based on incompatible 
techniques, which until now have refused to coexist. 


First Contribution We develop a new method, based on priors on a parameter called the learning rate and on experts, 
to derive algorithms and prove bounds that combine quantile and variance guarantees. Our new prediction strategy, 
called Squint, has regret at most 


-< — In7r(/C)) for each subset /C of experts, 


( 2 ) 


! Rr and v£ = E T ( fc | K ) Vf 


V T 

iT 


T denote the average (under the prior) among the reference experts 

The 


= ELi(rf) 2 . 


where Rff = E, r 

k £ 1C of the regret Rif = Ylt -1 r t anc * the (uncentered) variance of the excess losses V 7 
overhead Cf for learning the optimal learning rate is specified below. 

As is common for this type of results, our variance factor Vf' is opaque in that it depends on the algorithm as 
well as the data, yet our bound does imply small regret is several important cases. For example, we immediately see 
that variance and hence regret stop accumulating whenever the weights_concentrate, as will happen when one expert 
is clearly better than all the others. (The expert loss variance of Hazan and Kale 1 2010 1 does depend only on the data, 
but unfortunately may grow linearly even when the best expert is obvious.) Furthermore, Gaillard et al. 1 2014jl show 
that second-order bounds like (|2]» imply small regret over experts with small losses (Lip-bounds) and bounded regret 
both in expectation and with high probability in stochastic settings with a unique best expert. 

We will instantiate our scheme three times, varying the prior distribution of the learning rate, to obtain three 
interesting bounds. First, for the uniform prior, we obtain an efficient algorithm with Ci r = In Vfr . Then we consider 
a prior that we call the CV prior, because it was introduced by IChernov and Vovki 201 (jj| (to get quantile bounds), 
and we improve the bound to C\ r = lnlnV^f. As we consider ln(ln(tc)) to be essentially constant, this algorithm 
achieves our goal of combining the benefits of second-order bounds with quantiles, but unfortunately it does not have 
an efficient implementation. Finally, by considering the improper(l) log-uniform prior, we get the best of both worlds: 
an algorithm that is both efficient and achieves our goal with C\ r = In In T. The efficient algorithms for the uniform 
and the log-uniform prior both perform just K operations per round, and are hence as widely applicable as vanilla 
Hedge. 


Combinatorial games We then consider a more sophisticated setting, where instead of experts k £ {1,..., K}, 
the elementary actions are combinatorial concepts from some class C C {0,1} . Many theoretically interesting and 
important real-world online decision problems are of this form, for example subsets (sub-problem of Principal Corn- 


graph (routing), etc. (see for instance 

Takimoto and Warmuth 2003, Kalai and Vem 

pala 2005, 

Warmuth and Kuzmin 

2008, Helmbold and Warmuthl 2009, Cesa-Bianchi and Lugosi 2012l 

Warmuth et al. 

2014, Audibert et al. 

2014). The 


combinatorial structure is reflected in the loss, which decomposes into a sum of coordinate losses. That is, the loss of 
concept c £ C is c T £ for some loss vector £ £ [0, 1 ] ^. This is natural: for example the loss of a path is the total loss 
of its edges. Koolen et al. I 201oll develop Component Hedge (of the Mirror Descent family), with regret at most 


Rf 


-< 


\/TK comp(C) for each concept c £ C, 


(3) 
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where c omp( C), the analog of In A' for experts, measures the complexity (entropy) of the combinatorial class C. 
Luo and Schapirel 1 2014 1 derive s/T quantile bounds for online convex optimization. No combinatorial second-order 
quantile methods are known for combinatorial games. 


Second Contribution We extend our method to combinatorial games and obtain algorithms with second-order quan¬ 
tile regret bounds. Our new predictor Component iProd keeps the regret below 


Rf -< comp(’u) + KCi r ) for each v £ conv(C). 


(4) 


In the combinatorial domain the role of the reference set of experts K, is subsumed by an “average concept” vector 
v £ conv(C), for which our bound relates the coordinate-wise average regret Rf = 'ff lt k Vk to the averaged 
variance Vf = Y2t k v k( r t) 2 an d the prior entropy comp(u). 

Even if we disregard computational efficiency, our bound © is not a straightforward consequence of the experts 
bound © applied with one expert for each concept, paralleling the fact that © does not follow from ©. The reason 
is that we would obtain a bound with per-concept variance EtEiJfch) 2 instead, which can overshoot even the 
straight-up worst-case bound © by a \f~K factor ( Koolen et al.l ll2010ll call this the range factor problem). To avoid 
this problem, our method is “collapsed” (like Component Hedge): it only maintains first and second order statistics 
about the K coordinates separately, not about concepts as a whole 


1.1 Related work 

Obtaining bounds for easy data in the experts setting is typically achieved by adaptively tuning a learning rate , which 
is a parameter found in many algorithms. Schemes for choosing the learning rate on-line are buil t by Augr_gtalJ 120021, 
Cesa-Bianchi et al. 1 2007ll . lHazan and Kale I 2010ll . De Rooii et al.l l 2014ll . lGaillard et all 1 2014 1. Wintenber gerl l 2014 1. 


These schemes typically choose a monotonically decreasing sequence of learning rates to prove a certain regret bound. 

Other approaches try to aggregate multiple learning rates. The motivations and tech niques here show extreme di- 
versity, ranging from drift ing games by Chaudhu ri et ah J 2009 [, Luo and Schapirel 1 2014ll , and defensive f orecasting by 
IChernov and Vovk 1 2010ll to minimax relaxations by Rakhlin et al. 1 2013ll and budeeted timesharing by Koolen et al. 
1 201411 . The last scheme is of note, as it does not aggregate to reproduce a bound of a certain form, but rather to 
compete with the optimally tuned learning rate for the Hedge algorithm. 


Outline We introduce the Squint prediction rule for experts in Section [2] In Section [3 we motivate three choices 
for the prior on the learning rate, discuss the resulting algorithms and prove second-order quantile regret bounds. In 
Section[4]we extend Squint to combinatorial prediction tasks. We conclude with open problems in Section [3 

2 Squint : a Second-order Quantile Method for Experts 

Let us review the expert setting protocol to fix notation. In round t the algorithm plays a probability distribution w t 
on K experts and encounters loss £ t £ [0, 1] K . The instantaneous regret of the algorithm compared to expert k is 
r* = wj£t — f-t = ( w t ~ where is the unit vector in direction k £ {1,..., A}. Let Rif = 1 r t be the 

total regret compared to expert k and let Vf = ( r t) 2 be the cumulative uncentered variance of the instantaneous 

regrets. 

The central building block of our approach is a potential function <I> that maps sequences of instantaneous regret 
vectors r-\._T = (r 1 ,..., rr) of any length T > 0 to nu mbers. Potential functions are staple online learning tools 
I Cesa-Bianchi and Lugosi . 2006 . Aberneth v et all 201 4ll . We advance the following schema, which we call Squint 
(for second-order quantile integral). It consists of the potential function and associated prediction rule 


i>(ri:x) = E 

v(k)j(ri) 


e vRf-v 2 vf _ 1 


W T +1 = 


^ 7 r(fe) 7 (r ) ) 


®7r(fc) 7(77) 

e r]Rf-ri 2 Vf ^ 



(5) 
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where the expectation is taken under prior distributions 7 r on experts k £ { \K ) and 7 on learning rates 
r/ £ [0,1/2] that are parameters of Squint. We will see in a moment that Squint ensures that the potential remains 
$(tT:t) < 0 non-positive. Let us first investigate why non-positivity is desirable. To gain a quick-and-dirty appreci¬ 
ation for this, suppose that 1C C {1,..., K} is the reference set of experts with good performance. Let us abbreviate 
their average regret and variance to R = Rtf = ^K(k\K.) Rif and V = Vf' = V*. Furthermore, imagine for 

simplicity that the prior 7 puts all its mass on learning rate f) = 7 //. Now non-positive potential < l’(V | : r) < 0 implies 


1 > E 

ft) 

So 

1 

•3> 

to 

4 

> 7 r(/C) E 

7 2 

7r(fc) 


ir(fc|/C) 

- 


> 7r (JC)e 


f)R—f] 2 V _ 


= 7T (IC)e iV , 


( 6 ) 


which immediately yields the desired variance-with-quantiles bound 

Rif < 2In 7 r(/C)). (Back of envelope, but great promise!) 

This raises the question: how does Squint keep its potential <1» below zero? By always decreasing it: 

Lemma 1. Squint (0 ensures that, for any loss sequence £ 1 ,... ,£t in [0, f\ K , 

$(ri :T ) < ... < $(0) = 0. 

Proof. The key role is played by the upper bound I Cesa-Bianchi and Lugosil 20061 Lemma 2.4] 

e z-x 2 _ 1 < £ fora; >- 1 / 2 . 

Applying this to r/r ^ +1 > —1/2, we bound the increase ^(ru+i) — $(^ 1 ^) of the potential by 


(7) 


( 8 ) 


E 

7 r(fc) 7 ( 77 ) L 


tiR t n Lr ^g r t r T+1 i r i r T+ 1) _p 


m 

< E 
lr ( fc )7 in) L 


e vR T v v t^ W t+1 _ efc) T Tr+i 


= 0 , 


where the last identity holds because the algorithm’s weights 0 have been chosen to satisfy it. 


□ 


In Section [?] we will make the proof sketch from 0 rigorous. The hunt is on for priors 7 on ij that (a) pack plenty 
of mass close to r), wherever it may end up; and (b) admit efficient computation of the weights wt+ 1 by means of a 
closed-form formula for its integrals over ?/. We conclude this section by putting Squint in context. 


Discussion The Squint potential is a function of the vector 'f2 t=1 r t of cumulative regrets, but also of its sum of 
squares, which is essential for second-order bounds. Squint is an anytime algorithm, i.e. it has no built-in dependence 
on an eventual time horizon, and its regret bounds hold at any time T of evaluation. In addition Squint is timeless in 
the sense of De Rooii et al. 1201411 . meaning that its predictions (current and future) are unaffected by inserting rounds 
with £ = 0 . 

The Squint potential is an average of exponentiated negative “losses” (derived from the regret) under product 
prior n(k)'y(r]), reminiscent of the exponential weights analysis potential. Our Squint weights could be viewed as 
exponential weights, but, intriguingly, for another prior, with 7 ( 77 ) replaced by y(p)p. Mysteriously, playing the latter 
controls the former. 

The bound 0 is hard-coded in our Squint potential function and algorithm. To instead delay this bound to the 
analysis, we might introduce the alternative iProd (for integrated products) scheme 


= E 
7r(fe)7 W) 


Hil + pr?) -1 


\t =1 


^T+l = 


Eir(fe)7(r;) 

(nLi(i+^)) 

r]e k 

®7r(fc)7(r7) 

(nLit 1 + r i r t) / 




(9) 


The iProd weights keep the iProd potential identically zero, above the Squint potential by ®, and Squint’s regret 
bounds hence transfer to iProd. We champion Squint over the purer iProd because Squint’s weights admit efficient 
closed form evaluation, as shown in the next section. For q_a point- mass on a fixed choice of rj this advantage 
disappears, and iProd reduces to Modified Prod by Gaillard et al. [2014], whereas Squint becomes very similar to the 
OBA algorithm of Wintenberger 1 2014 ]. 
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3 Three Choices of the Prior on Learning Rates 

We will now consider different choices for the prior 7 on 77 G [0,1/2]. In each case the proof of the corresponding 
regret bound elaborates on the argument in ©, showing that the priors place sufficient mass in a neighbourhood of 
the optimized learning rate rj. This might be viewed as performing a Laplace approximationof the integral over 77, 
although the details vary slightly depending on the prior 7. The prior n on experts remains completely general. The 
proofs can be found in Appendix lAl 


3.1 Conjugate Prior 


First we consider a conjugate prior 7 with density 


d 7 _ e ar i- b v 2 
dp Z(a,b) 


where 


Z(a,b) 


1/2 


0 aT)-brj‘ 


dp 


( 10 ) 


for parameters a, b G R. The uniform prior, mentioned in the introduction, corresponds to the special case a = b = 0, 
for which Z(a , b) = 1/2. Abbreviating x = a + Rif and y = b + Vf, the Squint predictions (0 then specialize to 
become proportional to 


/ ^ 

/-> At, 


Wt+i ^ 7r (^’) / eVX 71 v r]dr] = 7 r(fc) 


e 


\ 


( erf ( 2 ^) ~ erf 

4 1/ 3 / 2 


1 — — 

1 — 4 


(id 


These weights can be computed efficiently (but see Appendix [B] for numerically stable evaluation). For this prior, we 
obtain the following bound: 


Theorem 2 (Conjugate Prior). Let ln-|_(x) = ln(max{x, 1}). Then the regret of Squint 0 with conjugate prior (110b 
(with respect to any subset of experts 1C) is bounded by 


Rf < 2 . 


'\ 


! , Z(a,b)Ji{Vf + b) 

( v f+h) ( 2 + ln + ( -- 



5 ln_i 


( Z(a,b) 

\ 7T(/C) 


( 12 ) 


The oracle tuning a 
(ED becomes 


0 and b = Vjr results in Z (a, b ) < 




Plugging this in we find that the main term in 


l2Vf 


+ ln + 


r(/C) 


which is of the form 0 that we are after, with constant overhead Cjr for learning the learning rate. Of course, the fact 
that we do not know Vj? in advance makes this tuning impossible, and for any constant parameters a and b we get a 
factor of order Ci r = In Vf' ■ 


3.2 A Good Prior in Theory 

The reason the conjugate prior does not achieve the optimal bound is that it does not put sufficient mass in a neigh¬ 
bourhood of the optimal learning rate rj that maximizes e T,flr_7? v ' r . To see how we could address this issue, observe 
that we can plug afj instead of /'/ into 0 for some scaling factor a G ( 0 , 1 ), and still obtain the desired regret bound 
up to a constant factor (which depends on a). This implies that, if we could find a prior that puts a constant amount of 
mass on the interval [afj, 77 ], independent of 77 , then we would only pay a constant cost C\ r to learn the learning rate, at 
the price of having a slightly worse constant factor. 

A prior that puts constant mass on any interval [afj, fj] should have a distribution function of the form a ln(? 7 ) + b 
for some constants a and b, and hence its density should be proportional to 1 /rj. But here we run into a problem, 
because, unfortunately, I /77 does not have a finite integral over 77 G [ 0 , 1 / 2 ] and hence no such prior exists! 
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The solution we adopt in this section will be to adjust the density I /77 just a tiny amount so that it does integrate. 
Let 7 have density 


d 7 In 2 
d ?7 77 In 2 ( 77 ) ’ 


where ln 2 (x) = (ln(a;)) 2 . We call this the CVprior , because it has previously been used to get quantile bounds by 
Chernov and Vovk l2010ll . The additional factor 1/In 2 ( 77 ) in the prior only leads to an extra factor of ./inIn Vp in 


the bound, which we consider to be essentially a constant. 

Although the motivation above suggests that we might obtain a suboptimal constant factor (depending on a), a 
more careful analysis shows that this does not even happen: apart from the effect of the 1 /In 2 ( 77 ) term in prior, we 
obtain the optimal multiplicative constant. 


Theorem 3 (CV Prior). Let ln + (a;) = ln(max{a;, 1}). Then the regret of Squint © with CV prior (113b (with respect 
to any subset of experts K.) is bounded by 



( 



\1 

R% < \hvS- 

1 + 

\ 

2 ln+ 

+ V 2-V2 ) 



7 r(/C) ln(2) 



l \ 


V 

n 


— 5 In7r(/C) + 4. 


(14) 


3.3 Improper Prior 

In the last section we argued that we needed a density proportional to 1 / 77 on 77 £ [0,1/2]. Such a density would not 
integrate, and we studied the CV prior density instead. However, we could be bold and see what breaks if we use the 
improper 1/?/ density anyway. We should be highly suspicious though, because this density is improper of the worst 
kind: the integral f^ 2 e vRrr ~ 11 Vt i d?/ diverges no matter how many rounds of data we process (a Bayesian would 
say: “the posterior remains improper”). Yet it turns out that we hit no essential impossibilities: the improper prior I /77 
cancels with the 77 present in the Squint rule (0, and the predictions are always well-defined. As we will see, we still 
get desirable regret bounds, but, equally important, we regain a closed-form integral for our weight computation. The 
Squint prediction 0 specializes to 


U T +1 


(fc) [ ' d?7 = 7T(fc) 

Jo 


y/ite 


erf 


2 \ vJt 


— erf 


R!p- Vf 

2 \/vf 


(15) 


(We look at numerical stability in Appendix [B]) This strategy provides the following guarantee: 


Theorem 4 (Improper Prior). The regret of Squint with improper prior (115b (with respect to any subset of experts 1C) 
is bounded by 


R$ < ^1 + ^2 In (i±-^L±i) j + 5 i n (1 + 1 + + 1} ) . (16) 

4 Component iProd : a Second-order Quantile Method for Combinatorial 
Games 

In the combinatorial setting the elementary actions are combinatorial concepts from some class C C {0,1} A . The 
combinatorial structure is reflected in the loss, which decomposes into a sum of coordinate losses. That is, the loss of 
concept c £ C is c J £ for some loss vector £ £ [0, 1] K . For example, the loss of a path is the total loss of its edges. We 
allow the learner to play a distribution p on C and incur the expected loss E p ( c ) [c T L] = E p ( c ) [c] T £ . This means that 
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the loss of p is determined by its mean, which is called the usage of p. We can therefore simplify the setup by having 
the learner play a usage vector u £ U, where U = conv(C) C [0, 1} K is the polytope of valid usages. The loss then 
becomes u J £. 

Koolen et al. I 2010ll point out that the Hedge algorithm with the concepts as experts guarantees 


f?y -< K -\/Tcomp(C), 


Expanded Hedge (17) 


upon proper tuning, where comp(C) is some appropriate notion of the complexity of the combinatorial class C. (This is 
exactly CD, where the additional factor K comes from the fact that the loss of a single concept now ranges over [0, K\ 
instead of [0,1].) The computationally efficient Follow the Perturbed Leader strategy has the same bound. However, 
Koolen et al. 1 2010 1 show that this bound has a fundamentally suboptimal dependence on the loss range, which they 
call the range factor problem. Properly tuned, their Component Hedge algorithm (a particular instance of Mirror 
Descent) keeps the regret below 

Component Hedge (18) 


Rt s/TKcomp(C), 

the improvement being due to the algorithm exploiting the sum structure of the loss. To show that this cannot 
be improved further, Koolen et al.l 1 201 Oi l exhibit matching lower bounds for a variety of combinatorial domains. 
Audibert et al. [2014] give an example where the upper bound ( ITTl i for Expanded Hedge is tight, so the range factor 
problem cannot be solved by a better analysis. 


In this section we aim to develop efficient algorithms for combinatorial prediction that obtain the second-order and 
quantile improvements of ( IT8l >. but do not suffer from the range factor problem. 

It is instructive to see that our bounds (0 for Squint/iProd, when applied with a separate expert for each concept, 
indeed also suffer from a suboptimal loss range dependence. We find 


Bf < y Lf (comp(C) + tuning cost), Expanded Squint/iProd 

where Vf = J2t=i( r T) 2 = !C(Li(SfcLi r t ) 2 with r t £ [—1,1] may now be as large as K 2 T, whereas we know I\T 
suffices. The reason for this is that Vf measures the variance of the concept as a whole, whereas the sum structure of 
the loss makes it possible to replace Vf by the sum of the variances of the components. In the analysis, this problem 
shows up when we apply the bound ®. To fix it, we must therefore rearrange the algorithm to be able to apply © 
once per component. 


Outlook Our approach will be based on a new potential function that aggregates over learning rates rj explicitly and 
over the concept class C implicitly. Our inspiration for the latter comes from rewriting the factor featuring inside the 
E 7 ( ?) ) expectation in the iProd potential © as 


E 

w{k) 


tla+rt) 


n 


E 7T (fc) 

nU(!+^ fc ) 

E 7T (fc) 

nl= 1 i( i +’F* ; ) 


= e - 


(19) 


which we interpret as the mix loss (see De Rooii et al. 2014) of the exponential weights distribution p ]? on auxiliary 
losses: 


— In E 

p(k) L 


Pt( k ) = 


7 T (k)e 


t -1 


rj,k _ 


e t 


(fc) 


t —1 r/,k 


=> Z^s = 1 x 


= -ln(l+ 777f). 


Thus, for each fixed 77 , we have identified a sub-module in which the loss is the mix loss. It turns out that the 
Squint/iProd regret bounds can be reinterpreted as arising from (quantile) mix loss regret bounds for exponential 
weights in this sub-module. For combinatorial games, we hence need to upgrade exponential weights to a combina¬ 
torial algorithm for mix loss. No such algorithm was readily available, so we derive a new algorithm that we call 
Component Bayes (a variant of Component Hedge) in Section |4~T1 and prove a quantile mix loss regret bound for 
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Algorithm 1 Component iProd. Required subroutines are the relative entropy projection step (3) and the decompo¬ 
sition step (5). For polytopes U that can be represented by few linear inequalities these can be deferred to general- 
purpose convex and linear optimizers. See Koplen et alj 12010tl for more details, and for ideas regarding more efficient 
implementations for example concept classes. 


Input: Combinatorial class C C {0,1} A with convex hull U = conv(C) 

Input: Prior distribution 7 on a discrete grid Q C [0,1 /2] and prior vector n G [0,1 ] K 


9 : 

10 : 


For each 77 £ Q, initialize u r [ = tz and L'l = — 111 ( 7 ( 77 ) 77 ) 

for t = 1,2, ... do 

For each 77 £ Q, project u^ = min ugW A 2 (u\\ut ) 

Compute usage u t = e ~ L * u t ' j X) r/ e ~ L * 

Decompose u t = 77,77 ., into a convex combination of concepts c f/l £ C 

Play Cj j with probability p; 

Receive loss vector £ t , incur expected loss uj£ t 

For each 77 £ Q and k , update u}+ { = u. 


Tc~ 


rj,k 

't l+ v (u k - 
K 


)A t 


For each rj £ Q, update L v t+1 = L v t - X)f =1 ^ ln(l + 77 ( 77 k - u^ k )^) 

end for 


mEB 

> (12 1 ab 

>@) 


> d2lbt. (f33l). (f24l>. (f22l) 
i> Goli.El.El) 


it. Then in Section l4~2l we show that Component iProd, obtained by substituting Component Bayes for exponential 
weights in the sub-module above, inherits all of iProd’s desirable features. That is, by aggregating the above sub- 
module over learning rates the Component iProd predictor delivers low second-order quantile regret. Component 
iProd is summarized as Algorithm!]] Proofs can be found in Appendix lAl 


4.1 Component Bayes 

In this section we describe a combinatorial algorithm for mix loss, which will then be an essential subroutine in our 
Component iProd algorithm. We take as our action space some closed convex U C [0, \} K . The game then proceeds 
in rounds. Each round t the learner plays u t £ U, which we interpret as making I\ independent plays in K parallel 
two-expert sub-games, putting weight u\ and 1 — 7 i( on experts 1 and 0 in sub-game k. The environment reveals a 
loss vector x t £ R Xx {°T} ( we use x for the loss in this auxiliary game, and reserve l for the loss in the main game), 
and the loss of the learner is the sum of per-coordinate mix losses: 

K 

£ m i x (u, x ) = ^ - In (u k e~ xk ' 1 + (1 - u k )e ~ xk ’°) . (20) 

k =1 

The goal is to compete with the best element v £ U. We define Component Baye^ inductively as follows. We set 
U\ = 7 T £ [ 0 , 1 ] A to some prior vector of our choice (which does not have to be a usage in U), and then alternate 


u t 


u t +1 


arg min A 2 (it 11 Ut ) 

u£U 


arg min A 2 (it 11 u t ) 

m6[ 0,1] k 


where A 2 denotes the binary relative entropy, defined from scalars x to y and vectors v to u by 


( 21 a) 

( 21 b) 


A 2 ( a; lly) = £ In — + (1 — x) In ^—- and A 2 (u||u) = V) A 2 (v k \\u k ). 

y { ~y ti 

1 Ign oring a sma ll t echnic ally convenient switch from generalized to binary relative entropy we find that Component Bayes equals Component 
Hedge of Ko olen et al.111201 (J1 . The new name stresses an important distinction in the game protocol: Component Hedge guarantees low linear loss 
regret, Component Bayes guarantees low mix loss regret. 


8 




















This simple scheme is all it takes to adapt to the combinatorial domain. 

Lemma 5. Fix any closed convexIA C [0,1]^. For any loss sequence ..., xt in R A x ^ 0,1 ^, the mix loss regret of 
Component Bayes <12 1 b with prior it G [0,1]^ compared to any v CU is at most 

^ A 2( u ll 7r )- 

t =1 t =1 fc =1 


The practicality of Component Bayes does depend on the computational cost of computing the binary relative 
entropy projection onto the convex set U. Fortunately, in many applications U has a compact representation by means 


of a few linear inequalities; e.g. the Birkhoff polytope (permutations) and the Flow polytope (paths). See I Koolen et al 


2010tl for examples. Component Bayes may then be implemented using off-the-shelf convex optimization subroutines 


like CVX. 


4.2 Component iProd 


We now return to our original problem of combinatorial prediction with linear loss. Using Component Bayes (which 
is for mix loss) as a sub-module, we construct an algorithm with second-order quantile bounds. We first have to 
extend our notion of regret vectoi0. Suppose the learner predicts usage u t €U C [0,1] A and encounters loss vector 
It G [—1, +1]^. We then define the regret vector r t G [ — 1, by 

r\' x = u k L k -i k and r k '° = u k i k . ( 22 ) 


Fix a prior vector 7 r G [0,1] A and prior distribution 7 on [0,1/2]. We define the Component iProd potential function 
and predictor by 


$(ri - t ) = E 

i(v) 1 




ut = 


E 7 (^) 


e 7 m 

'e-kT.fJi 



where m/, uf ... denote the usages of Component Bayes with prior 7 r on losses x r (. x’f ... set tc0 

= - In (l + jyrf’ 6 ) 


ri.k.b 
x t 


(23) 


(24) 


Note that ut G K is a bona fide action, as it is a convex combination of u^ G U. As can be seen from ( fl9l >. this 
potential generalizes the iProd © potential: in the base case K = 1 and C = {0,1} Component iProd reduces to 
iProd <[9]) on K = 2 experts if we set the loss for Component iProd to the difference of the losses for iProd. We will 
now show that Component iProd has the desired regret guarantee. 

Lemma 6. Fix any closed convex U C [0,1]^. Component iProd ( 1231 ) ensures that for any loss sequence t \,..., 
in [—1, +1] A we have *!>(ri : ;r) < ... < $(0) = 0 . 


We now establish that non-positive potential implies our desired regret bound. We express our quantile bound in 
terms of the v-weighted cumulative coordinate-wise regret and uncentered variance 


r t = EE^ + a-^’ 0 ) 

t=i k= 1 

v t = EXE^V + a-^xuE 2 3 ) 

t =1 k =1 

2 Interestingly, the natural generalization of the expert regret vector r^ = ( ut/K — e^) J £.t, which renormalizes the usage, does not result in 
the desired result. To see this, consider a perfect scenario with a clearly best concept c G Con which the learner fully concentrates its predictions 
ut = c. This should not result in any regret compared to c. But for k E c we still have r ^ 7 ^ 0 (unless all coordinates feGc suffer identical loss), 
and so the variance may accumulate linearly. 

3 We could alternatively set x^ ,k,b = r/ 2 (r^ ,b ) 2 — rjr^ ,b and prove the same regret bound. But to get the tighter algorithm we delay the bound 
to the analysis. See the discussion surrounding m about Squint vs iProd. 


- v) T £ t , 

t= 1 
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Lemma 7. Suppose 7 is supported on a discrete grid Q C [0,1/2]. Component iProd (123b guarantees that for every 
rj £ G and for every comparator v £ LA the regret is at most 

vRt ~ rfVf < A 2 (i>||7t) — AT In 7(77). 

We now discuss the choice of the discrete prior 7 on 77. Here we face a trade-off between regret and computa¬ 
tion. More discretization points reduce the regret overhead for mis-tuning, but since we need to run one instance of 
Component Bayes per grid point the computation time also grows linearly in the number of grid points. Fortunately, 
Lemma[ 7 ]implies that exponential spacing suffices, as missing the optimal tuning 77 = by a constant factor affects 
the regret bound by another constant factor. To see this, apply Lemma[ 7 ]to 77 = afj. We find 

Rt < / 1 x J V T{ A 2( v \\ n ) ~ iOn 7 ( 0 : 77 )). (25) 

V a ( 2 - a ) 

It is therefore sufficient to choose 77 from an exponentially spaced grid Q. In particular, we propose to let 7 be the 
uniform distribution on 

G = ( 2 _i |i = l,...,fl + log 2 T]}. (26) 

This leads to the following final regret bound: 

Theorem 8 . LetlA C [ 0 ,1] A be closed and convex. Component iProd (123b . with 7 the uniform prior on grid G from 
(126b and arbitrary 7 r £ [ 0 , l] K , ensures that, for any sequence i\ ... ,£t of [— 1 , +1 \-valued loss vectors, the regret 
compared to any v £li is at most 

Rt < ( A 2 OIK) + K In[1 + log 2 T]) + 4 A 2 (u||7t) + K max{4ln[~l + log 2 T], 1}. (27) 


Discussion of Component iProd We showed that if we have an algorithm for keeping the mix loss regret small 
compared to some concept class, we can run multiple instances, each with a different learning rate factored into the 
losses, and as a result also keep the linear loss small with second order quantile bounds. Another setting where this 
could be applied is to switching experts. The Fixed Share algorithm by Herbster and Warmu thi [j 1998 1 app lies to all 
Vovk mixable losses, so in particular to the mix loss, and delivers adaptive regret bounds~ lAdamskiv et al. . 2012 1. 
Aggregating over log 2 T exponentially spaced 77 to learn the learning rate would indeed be very cheap. Yet another 
setting is matrix-valued prediction under linear loss llTsuda et all 2005 1. where our method would transport the mix 
loss bounds of lWarmuth and Kuzminl 1 2010h to second-order quantile bounds. 

In Lemma [7] we see that the cost — 1117 ( 77 ) for learning the learning rate 77 occurs multiplied by the ambient 
dimension K. Intuitively this seems wasteful, as we are not trying to learn a separate rate for each component. But we 
could not reduce K to 1. For example, defining the potential d23t without the division by K escalates its dependency 
on the loss £ from linear to polynomial of order K. Unfortunately this potential cannot be kept below zero even for 
K = 2. 


5 Conclusion and Future Work 


We have constructed second-order quantile methods for both the expert setting (Squint) and for general combinatorial 
games (Component iProd). The key in both cases is the ability to learn the appropriate learning rate, which is reflected 
by the integrals over 77 in our potential functions (0 and ( 123b . As discussed in Section o there is a whole variety 
of different ways to adapt to the optimal 77 . This raises the question of whether there is a unifying perspective that 
explains when and how it is possible to learn the learning rate in general. 

Another issue for future work is to find matc hing lower bounds. Although lower bounds in terms of \/T In K are 
available for the worst possible sequence I Cesa-Bianchi and Lugosi 1 2006 1. the issue is substantially more complex 
when considering either variances or quantiles. We are not aware of any lower bounds in terms of the variance Vfi. 
Gofer and Mansour 11201211 provide lower bounds that hold for any sequence, in terms of the squared loss ranges in 


10 





































each round, but these do not apply to methods that adaptively tune their learning rate. For quantile bounds, Koolen 
1 2013 1 takes a first step by characterizing the Pareto optimal quantile bounds for 2 experts in the y/T regime. 

Finally, we have assumed throughout that all losses are normalized to the range [0,1]. But there exist second-order 
methods that do not require this normalizat ion and can adapt automatically to the loss range | Cesa-Bianchi et al.. 20071 
De Rooii et al. . 2014 . Wintenberuerl 2014 1. It is an open question how such adaptive techniques can be incorporated 


elegantly into our methods. 
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A Proofs 

This section collects the proofs omitted from Sections0and[4] 


A.l Theorem |2| 

Proof. Abbreviate R = Rff + a and V = Vjf + b. Then from © and Jensen’s inequality we obtain 

n{lC) Ett( k\K) fo 1/2 e^+^ 2 ( v r+ b ) d? 7 > n(JC)f 0 1/2 e^ 2y d V 
Z(a,b) ~ Z(a,b) 


1 > E 

7 


0 rjR^—rj 2 V^ 


The 77 that maximizes 77J? — rfV is 77 = Without loss of generality, we can assume that fj > -j== > 0, because 

otherwise R < V2V. from which (fi~2l i follows directly. Now let [u, u] C [0, 4] be any interval such that v < fj. Then, 
because r/R — rj 2 V is non-decreasing in 77 for 77 < fj, we have 


/*l/2 rV 

/ e vR-ri 2 v dr] > / e vR-v 2 V dr] > y _ u y 
J 0 J u 

so that the above two equations imply 


uR-u 2 V 


uR — u 2 V < In 


( Z(a,b) \ 

- u)) 


(28) 


Suppose first that fj < 1/2. Then we take v = fj and u = 77 — — p=. Plugging these into ( l28l > we obtain 


R< 2 


\ 



Z(a,b)VZV\' 


r (/C) 


which implies (fl2l >. Alternatively, we may have 77 > 1/2, which is equivalent to R > V. Then the left-hand side of 
(l28l> is at least u(l — u)R and hence we obtain 


R < 


1 

(1 — u)u 


In 


( Z{a,b) \ 

\tt(JC){v -u)J 


Taking u = 5 and v = 1/2 then leads to the bound 


which again implies (IT 2 ll . 


R< 5 In 


[ 2^5 Z(a,b)\ 

^ Aic) J ’ 


□ 


A.2 Theorem [3] 

Proof. Abbreviate R = Rff and V = VjT, and let 77 = ^ be the 77 that maximizes r/R — rfV. Then rfR — 77 2 TA is 
non-decreasing in rj for 77 < 77 and hence, for any interval [rt, 7;] C [ 0 , 1 / 2 ] such that v < fj , we obtain from £ 7 ]) and 
Jensen’s inequality that 


1 > 7r(/C) 


E 

„ T~> h 2 T T k 

g VR-T—r 1 

E 

e vR-v 2 v 

7 r(fc|/C) 7 (r;) 


7(77) 

■ 


> 7 r(/C) 7 ([u, v])e 


uR—uV 


(29) 


where 



ln( 2 ) ln( 2 )ln(g) 

Hh)~ 


(30) 
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If R < 2 y/V, then ( 1 1 4b follows by considering the cases V < 4 and V > 4, so suppose that R > 2y/V, which implies 
Now suppose first that fj < 1/2. Then we take v = fj and u = fj — i > 0, for which 


^uR—uV 


ln({ 


R_ 

e4v 


In 


(i-V) 


R± 

e4v' 


In 


In 2 (—) 

' 14 ' 


In ( 


iv 


> 


(dv) 


R-VW> 


ln 2 (^4) 

v 2-\/2' 


where the inequality follows from R > 2 y/V. By e^ 2 1 ) = e^ x 1 ) 2 e x 1 > e^ 1 ^x and In y/— > x, we can 
lower bound the numerator with 


e w-5 l n ( _1_'j > nVW = e ^( Jr-D 2 

Vzv )- 6 VW R e 


R 


Putting everything together, we obtain 


(^£=-l) 2 


1 > 


7 r(/C) ln( 2 )e 2 ^ vw 

i« 2 (*§) 


which implies 


/ 


f? < vW 


i + 

N 


In' 


2 ln 


(j^) \ 


7 r(/C) ln( 2 ) J ] ’ 


and (fl4l) is satisfied. 

It remains to consider the case fj > 1 / 2 , which implies R> V. Then we take v = 1 / 2 , and 


leads to 


uf? — u 2 V < — In 7 r(/C) — ln ^1 — 


ln(2) N 

Mi))' 


Using R > V, the left-hand side is at most u(l — u)R. The choice u = 5 then again implies (ITdl i. which completes 
the proof. □ 


A.3 Theorem |4] 

Proof. The proof of Lemma|T]goes through unchanged for the improper prior, but we have to be careful, because we 
cannot pull out the constant 1 from the integral over 77 in the potential function any more. So abbreviate R = Rif and 
V = Vjr. Then, by Q, R > —T, V < T, and lensen’s inequality. 


0 > = E 

7T(fc) 


r-1/2 e r_ x 


> 7r(/C) E 

Tv(k\K) 


■ d?y 


, 1/2 e r,R k T -n 2 vf _ X 


V 


■ d?y 


+ (i -*(£)) / 

Jo 


1/2 e -7jT-„ 2 T _ x 


■ dry 


> 7 r(/C) f 
Jo 


1/2 e r)R-r) 2 V _ ^ 


dry + (1 - tt(/C)) [ 
Jo 


1/2 e —rjT—rj 2 T _ 1 


■ dry. 


Now first for the bad experts that are not in JC, we will show that 


/• I / 2 p-vT-^T _ 1 1 

/ -dry > - ln(T+l). 

>0 V 2 


(31) 


14 






























Let e £ [0,1/2] be arbitrary. Then, using e x > 1 + x and e x > 0, we obtain 

rl/2 e - V T- V 2 T _ re e - v T- v 2 T _ ]_ rl/2 e - v T- v 2 T _ x 

/ -- dr) = -d T)+ - dr) 

J0 V Jo V Je V 

c 2 


> r -^-^ dr). 

Jo V 


ri/2 _j £ 

— dr) = -eT ——T + ln(2e). 
- V 2 


The choice e = 2 (t+i) i m pli es GD for all T > 0. 

Second, for the good experts that are in 1C, we proceed as follows. Let fj = be the rj that maximizes r)R — r) 2 V. 
Then i)R — rj 2 V is non-decreasing in rj for r) <fj and hence, for any interval [u, i>] C [0,1 /2] such that v < fj. 


r-l/2 eV R-ri 2 V _ l 


- d r) > 


fU e 0R—0V _ ^ 


= e 


o V 

R—u 2 V 


dr) + ( e uR ~ u2v - 1) f — dr) — 


, 1/2 1 


■ di) 


^ e uR-u 2v -I'j ln^+ln(2u). 


(32) 


We may assume without loss of generality that R > 2 y/V (otherwise ( IT6l ) follows directly), which implies that 
fl ~ 7W' 

We now have two cases. Suppose first that fj < 1/2. Then we plug in v = fj and u = fj — -^= and use R > 2\/V 
to find that 

/■1/2 e r,R-v*V _ i ( \, / 1 \ , [R\ 

L — V — <i,lS ( e “'" 5 - 1 )i"( i rrwJ + i"(vJ 


> 


('*■*- 0 (ttW)-! 1 " (t) ■ 


Using e^ 2 ^ = e ^ x 1 '> 2 e x 1 > e^ x ^ x, — 1 > — and In > x, we find 


2 - 1^ In 


1 - 


Vzv 


> e 


i (-Ji _ 2 

.2Vx/2V ) 


R R 


VW Vzv) R 

1r / i 


\ VW 

-- = e l\V2V 


-1. 


Putting everything together and using V < T together with 1 + 2 In 2- < \ + ln(T + 1) for T > 1, we get 
0 > t r(/C) (e^ (^ _1 ) -1-2 In - (1 - tt(/C)) (2 + ln(T + 1)) 

> 7r(/C) ^ - (i + ln(T + 1)) , 


which implies 


R < VW ( 1 + ^2hi 


2 T ln(T + 1) 
7 r(/C) 


and ( IT6l > follows. 

It remains to consider the case that fj > 1/2, which implies R > V. We then use v = 1/2, for which (l32l > leads to 

f 1 / 2 pVR-r ) 2 V _ 1 2 1 1 

/ --— dr/ > {e uR ~ u v - l)ln— > (e u{1 ~ u)R - 1) In —. 

J o r) 2 u 2 u 

Putting everything together then gives 

i + ln(T -f- 1 ) 


«< _i_ lii /' 1 + a-m)(i+‘"(7’+i)) 


i(l — u ) 


— ln(2zi)7r(/C) 


< —-In [ 1 + 2 


(1 — u) \ — ln(2u)7r(/C) / 


And ( IT6l ) follows by plugging in u - 


5-V5 
10 ' 


□ 
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A.4 Lemma |5| 

Proof. Note that (121bl) is minimized at the independent component-wise posteriors 


k —t k ' 

u+e Xt 


H +1 


u k e ~ x t' +(1 -u$)e~ x t’ 

The instantaneous mix loss regret in coordinate k in round t hence equals 


( 33 ) 


In + (1 - Uj- v k x kA - (1 - v k f 


k i ^i+1 , /■-] fcM ^ ™t +1 

= v In — j —h (1 — u ) in ■ 


u k 


1-uf 


= A 2 (v k \\u k ) - A 2 (v k \\u k +1 ), 

and we can write the cumulative regret as 


T K T 

^^(A 2 (^||4)-A 2 (u fc ||^ +1 )) = ^A 2 HK)-A 2 (r,||u t+1 )). 

i=1 k= 1 t=l 


(34) 


As A 2 is a Bregman divergence (for convex generator F{x) = 'Yhk x k lnatfc + (1 — an) ln( l — Xk )), it is non¬ 
negative and satisfies the generalized Pythagorean inequality for Bregman divergences ICesa-Bianchi and Lugosi , 
2006, Lemma 11.3]. Since v G U and u t +i is the projection of iit+i onto If these properties together imply that 


A 2 (v||M t+ i) < A 2 (u||M t+ i) + A 2 (M t+ i||u t+ i) < A 2 (ti||'u t+ i). 

Hence the cumulative mix loss regret satisfies 
T 

0} < ^(A 2 (t;||tx t ) - Aa^H-ut+i)) = A 2 (u||iti) - A 2 (v||u t+ i) < A 2 (v||7r), 

t= 1 

as required. □ 

A.5 Lemma [6] 


Proof. First, observe that, for any rj. 


_( . ( ll r l T. v ) I32J- Jjnsen l ( nk — ,, n k\ 

3 K^ u t^t) < — y , (we x t + (1 — Ut)e Xt J 

k =1 

- E (E'E+^ M ) + (! - +w k ’°)) 

k =1 


fc=l 


We hence have 


$(fT:T+l) ~ = E 

7 W) L 

< E 
l(v) L 


3 — 'A,T=1 j^mix( u T + l >:C T+l) ^ 

e“i EiT=iW«?.* 7 ')iL ^ Mt+1 _ m^, +1 ) t £ 


= 0, 


where the last equality is by design of the weights (l23l >. 


□ 
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A.6 Lemma 0 

Proof. Lemma[ 6 ]tells us that Component iProd ensures T (77 -r) < 0. For any rj, this implies 


Lemma[6] 

—A'In 7 ( 77 ) > 


-E< 

t =1 


<(«?, 


Leinmafn 


X 


') > 




A 2 (t>||7r) 


T K 

E311J / k ( 2/ fe,l\2 fe,l\ , f 1 fc\ / 2/ fc,0\2 k, 0\\ 

> - 2^ i 1 ' yi ( r t ) - vr t ’ J + (1 - v K ) {y (r t ) - 7 ?r t JJ 

t=l k —1 

= vRt - ifvf - a 2 H|tt), 


A 2 (t;||7r) 


from which the result follows. □ 

A.7 Theorem [8] 

Proof. The exponentially spaced grid of learning rates ensures that, for any 77 € \^jf ■ 3 ], there always exists an 
a € [ 3 , 1 ] for which ar/ is a grid point. Hence, whenever fj = € [ 3 ^, 1/2], (|25| ) implies that 

Rt < ^Vf(A 2 (v\\rv) + K\n\l + log 2 T]), 

and (l 27 l) is satisfied. Alternatively, if 77 < -fp. then IPf < Vf /T < K. and (ITT! ) again holds. Finally, supppose that 
77 > 1/2. Then Rf > Vf, and plugging 77 = 1/2 into Lemma[ 7 ]results in 

\ R t~\ V t < A 2 (t>||7r) + Ann|"l + log 2 T"|. 

Using that Rif > Vf. the left-hand side is at most / IPf. from which (f27l) follows. □ 


B Numerical stability 


Although we are not numerical specialists, it is clear that some care should be taken evaluating the weight expressions 
for the conjugate prior (fill and improper prior (fiTl i. Initially, and as long as V = 0, we have R = 0 and hence by 
© Squint sets the weights w equal to the prior 7 r. We now assume V > 0, and look at (fTO and ( IT?] ). Both involve a 
contribution of the form 

2 W 

This expression is empirically numerically stable unless both erf arguments fall outside [— 6 , 6 ] to the same side. In 
other words, it can be used when 


jd jd jy 

—6 < — -j= and - < 6 , that is A S [—12\/U, U + 12-v/U]. (36) 

2vU 2vU 

If we are not in this range, then we are feeding extreme arguments into both erfs. Hence we may Taylor expand (l35ll 
around R = ±00 (both of which give the same result) to get 


a 
e 2 


- 1 


R 


( 0 th and 1 st order) 


or 


e$-${R+V)- R 
~R 2 


( 2 nd order). 


Note that this 0th order expansion is negative for R £ [0, V/2], but that falls well within the stable range (1361 ) where 
we should use (l35l) directly. 
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